Automatic Identification of Research Articles from Crawled Documents

نویسندگان

  • Cornelia Caragea
  • Jian Wu
  • Kyle Williams
  • Sujatha Das
  • Madian Khabsa
  • Pradeep Teregowda
  • C. Lee Giles
چکیده

Online digital libraries that store and index research articles not only make it easier for researchers to search for scientific information, but also have been proven as powerful resources in many data mining, machine learning and information retrieval applications that require high-quality data. The quality of the data available in digital libraries highly depends on the quality of a classifier that identifies research articles from a set of crawled documents, which in turn depends, among other things, on the choice of the feature representation. The commonly used “bag of words” representation for document classification can result in prohibitively high dimensional input spaces and may not capture the specifics of research articles. In this paper, we propose novel features that result in effective and efficient classification models for automatic identification of research articles. Experimental results on two datasets compiled from the CiteSeer digital library show that our models outperform strong baselines using a significantly smaller number of features.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feedback Prediction for Blogs

The last decade lead to an unbelievable growth of the importance of social media. Due to the huge amounts of documents appearing in social media, there is an enormous need for the automatic analysis of such documents. In this work, we focus on the analysis of documents appearing in blogs. We present a proof-of-concept industrial application, developed in cooperation with Capgemini Magyaroszág K...

متن کامل

A Corpus-based Study of Lexical Bundles in Discussion Section of Medical Research Articles

There has been increasing interest in utilizing corpora in linguistic research and pedagogy in recent years. Rhetorical organization of different sections of research articles may appear similar in various disciplines, but close examination may show subtle differences nonetheless. One of the features that has been at the center of attention especially in recent years is the idiomaticity of a di...

متن کامل

Fully Automatic Compilation of Portuguese-English and Portuguese-Spanish Parallel Corpora

This paper reports the fully automatic compilation of parallel corpora for Brazilian Portuguese. Scientific news texts available in Brazilian Portuguese, English and Spanish are automatically crawled from a multilingual Brazilian magazine. The texts are then automatically aligned at documentand sentence-level. The resulting corpora contain about 2,700 parallel documents totaling over 150,000 al...

متن کامل

Species Disambiguation for Biomedical Term Identification

An important task in information extraction (IE) from biomedical articles is term identification (TI), which concerns linking entity mentions (e.g., terms denoting proteins) in text to unambiguous identifiers in standard databases (e.g., RefSeq). Previous work on TI has focused on species-specific documents. However, biomedical documents, especially full-length articles, often talk about entiti...

متن کامل

Contextual information retrieval in research articles: Semantic publishing tools for the research community

In recent years, the dramatic increase in academic research publications has gained significant research attention. Research has been carried out exploring novel ways of providing information services using this research content. However, the task of extracting meaningful information from research documents remains a challenge. This paper presents our research work on developing intelligent inf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014